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Pursuant to 37 C.F.R. § 1.102(d) and M.P.E.P. 
§ 708.02(11) (7'^'' ed., rev. 1), applicants hereby petition 
to make special the above-identified application, which 
contains claims that applicants believe are actually being 
infringed in the United States. Applicants attach hereto in 
support of the instant Petition: 

a copy of Shoemaker et al., Nature 409:922 - 927 
(2001) ; 

• a copy of Penn et al., Nature Genetics 26:315 - 
318 (2000); and 



• the Declaration of Dr. Sharron G, Penn, 
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the fee set forth in 37 C.F.R. § 1.17(h); and 

• a Second Supplemental Information Disclosure 
Statement (with accompanying form PTO-144 9 in 
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Status of the Claims 

The instant application was filed 
January 29, 2001, with original claims 1-20. On July 6, 
2001, applicants filed a Preliminary Amendment* adding new 
claims 21-92. Claims 1-92 are presently pending and 
have not yet been acted upon. 

Infringement 

Applicants attach hereto as Exhibit A a copy of 
Shoemaker et al . , "Experimental annotation of the human 
genome using microarray technology," Nature 409:922 - 927 
(15 February 2001) (the "Shoemaker reference"), which 
describes activities undertaken by Rosetta Inpharmatics, 
Inc., of Kirkland, Washington, USA. The undersigned 
attorney of record has made a rigid comparison of the 
activities described in the Shoemaker reference with 
claims 1 - 92 of the instant application; in the opinion of 
the undersigned, some of the claims, were they to issue. 



* The Preliminary Amendment was styled a ''Second 
Preliminary Amendment Under 37 C.F.R. § 1.115" in order to 
distinguish it from a Preliminary Amendment earlier filed to 
direct entry of the Sequence Listing. 
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would unquestionably be infringed by the activities 
described by Shoemaker et al. 

Prior art 

Applicants file concurrently herewith a Second 
Supplemental Information Disclosure Statement, with 
accompanying PTO form 1449 (in duplicate) and cited 
references. Upon such filing, applicants will have made of 
record in the above-identified application the Shoemaker 
reference and applicant's own publication, Penn et ai., 
"Mining the human genome using microarrays of open reading 
frames," Nature Genetics 26:315 - 318 (November 2000) ("Penn 
reference") (also attached hereto as Exhibit B) , which 
applicants deem the two references most closely related to 
the subject matter encompassed by the claims. Applicants 
further will have made of reference all of the references 
that are cited, in turn, by either the Shoemaker or Penn 
reference . 

Applicants attach hereto as Exhibit C the 
Declaration of Dr. Sharron Penn, first-named inventor of the 
instant application and first-named author of the Penn 
reference. In the Declaration, Dr. Penn declares that she 
has a good knowledge of the pertinent prior art, and that 
the Shoemaker and the Penn reference are the two references 
deemed most closely related to the subject matter of the 
claims of the above-identified patent application. 

Conclusion 

Applicants respectfully submit that the above- 
identified patent application should be made special; grant 
of special status and an early and favorable action are 
respectfully requested. 
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Experimental annotation of the human 
genome using mioFoarray technology 

D. D. Shoemaker*, E. E. Schadt*, C. D. Armour, Y. D. He, F. Gairett-Engele, P. D. McDonagh, P. M, Loerch, A. Leonardson, P. Y. Lum, 
G. Cavet, L F. Wu, S. J. Altschuler, S. Edwards, J. King, J. S. Tseng, G. Sdiimmack, J. M. Schelter, J. Koch, M. ZIman, M. J. Marton, 
B. U, P. Cundlff, T. Ward, J. Castle, M. KrolewskI, M. R. Meyer, M. Mao, J. Burchard, M. J. KIdd, H. Dal, J. W. Phillips, P. S. Unsley, 
R. Stoughton, S. Scherer & M. S. BoguskI 

Rosetta Inpharmatics, Inc., 12040 115th Avenue N.E., Kirkland, Washington 98034^ USA 
* These authors contributed equally to this work 



The most important product of the sequencing of a genome Is a complete, accurate catalogue of genes and their products, primarily 
messenger RNA transcripts and their cognate proteins. Such a catalogue cannot be constructed by computational annotation 
alone; it requires experimental validation on a genome scale. Using 'exon' and 'tiling' arrays fabricated by ink-Jet oligonucleotide 
synthesis, we devised an experimental approach to validate and refine computational gene predictions and define full-length 
transcripts on the basis of co-regulated expression of their exons. These methods can provide more accurate gene numbers and 
allow the detection of mRI\IA splice variants and Identification of the tissue- and disease-specific conditions under which genes are 
expressed. We apply our technique to chromosome 22q under 69 experimental condition pairs, and to the entire human genome 
under two experimental conditions. We discuss implications for more comprehensive, consistent and reliable genome annotation, 
more efficient, full-length complementary DNA cloning strategies and application to complex diseases. 



The initial interpretation of a genome sequence rests upon conclu- 
sions derived solely from bioinformatics approaches — ah initio 
gene predictions, homology studies, motif analysis and other 
non-experimental methods ^"^ The limitations and fallibility of 
this process have been discussed^*^ and one group has concluded^ 
that, despite more than 17 years of research effort^ precise annota- 
tion of every gene in the human genome by computational methods 
alone is still a distant goal. Bioinformatics analyses of fragmentary 
experimental data have led to widely varying estimates of the 
number of human genes*' Comparative genomics approaches, 
particularly between human and mouse' will help to identify 
candidate genes and refine their structures, but cannot alone show 
that a gene is active. Consequently, projects to clone and catalogue 
*fuU-length' cDNA clones from human'* and mouse' ^ have been 
undertaken. Although these projects may capture the complete 
coding sequences of many genes in time, cDNA cloning fixes a 
gene product at a particular time and under particular conditions, 
and thus cannot efficiently reveal the multiform nature of a 
metazoan transcriptome. 

Recent work indicates that the human genome may contain fewer 
genes than anticipated*'', and that frequent alternative splicing 
might account for much physiological complexity ^^''. This situa- 
tion makes it essential to pursue a course that efficiently yields 
empirical validation of the structures of genes and simultaneously 
provides an accurate and complete catalogue of their expressed 
products (mRNA and cognate protein sequences). 

We describe a high-throughput, microarray-based experimental 
method to validate predicted exons, group the exons into genes by 
co-regulated expression and define full-length mRNA transcripts. 
The method involves the design and fabrication of 'exon arrays' 
consisting of long (50-60 bases) oligonucleotide probes derived 
from predicted exons, followed by hybridization with fluorescendy 
labelled cDNAs derived from specific cell lines or normal or diseased 
tissues. Absolute intensities (measuring cellular abundances) or 
intensity ratios (measuring differential expression regulation) 
from hybridized cDNAs are used to identify those probes that 
represent authentic exons under the conditions tested. In addition, 
the expression data can define gene boundaries, because adjacent 
exons that are co-regulated across many conditions are likely to be 
from the same transcript. For a higher-resolution view of gene 



structure, we use ^tiling arrays' in which overlapping oligonucleo- 
tides are designed to blanket an entire genomic region of interest. 
This approach can potentially reveal exons not identified by current 
gene prediction algorithms and provide information about alter- 
native splicing. 

We applied the exon array approach to a detailed analysis of 
human chromosome 22 under 69 pairs of experimental conditions. 
Tiling arrays were used to refine the structure of new genes 
discovered by exon analysis. Finally, a preliminary analysis of the 
entire human genome using exon arrays under two experimental 
conditions demonstrated the power of being able experimentally to 
validate hundreds of thousands of exon predictions, anticipating 
the prospect of analysing the entire human genome to a depth 
similar to that achieved on chromosome 22. 

Analysis of chromosome 22q using exon arrays 

Chromosome 22 was the first human chromosome to be completely 
sequenced and subjected to exhaustive computational annotation^. 
It has thus served as a benchmark for new computational and 
experimental methods of analysis^^^'. We designed a single ink-jet 
array to monitor the 8,183 exons annotated^ on chromosome 22q 
under diverse experimental conditions. Specifically, mRNAs from 
human cell lines and normal and diseased tissues (Fig. 1) were 
fluorescently labelled with two colours and hybridized in pairs to 69 
individual chromosome 22 exon arrays (see Methods). Figure 2a 
shows a graphical display of error- weigh ted log expression ratios" 
for all 8,183 exons across 69 condition pairs. We developed a gene 
identification algorithm that uses intensity and ratio information to 
identify exons in a local neighbourhood that are strongly correlated 
across condition pairs, and then to extend such regions by incor- 
porating other local exons with similar expression behaviour. The 
resulting 572 groups of co -regulated exons are referred to as 
expression-verified genes (EVGs). Figure 2b-e shows expanded 
views of specific regions of chromosome 22. Expression data can 
be used to confirm the exons and structure of a known gene (Fig. 
2b), to identify potential false positive exon predictions (Fig. 2c), to 
merge UniGene clusters into a single gene (Fig. 2d) and to verify ab 
initio gene predictions experimentally (Fig. 2e). 

For a chromosome-wide performance summary, we compared 
our experimentally derived EVGs to the list of 545 genes annotated 
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by Dunham et al^ (Table 1). These annotated genes were divided 
into four categories (known, related, predicted and ab initio) on the 
basis of the level of experimental support for the predictions. We 
identified 210 (85%) of the 247 known genes by analysing the 
expression data from the 69 condition pairs with our gene detection 
algorithm. The remaining 15% of known genes did not exhibit 
sufficient differential expression regulation among the conditions 
tested to enable ratio-based algorithms to verify them. We detected 
66% of the related genes and 53% of the predicted genes using our 
expression regulation criteria. The most interesting result comes 
from the 325 ah initio genes that represented pure Genscan pre- 
dictions. Dunham et al^ speculated that only 100 of these predicted 
transcripts would represent portions of 'real' genes, but we 
found experimental support for 185 (57%) of the genes in this 
category. 

A few of the EVGs that we detected contained more than one 
gene. This occurred when adjacent genes were co-regulated across 
the 69 experimental conditions tested. In most cases, this situation 
can be addressed by testing additional conditions or by using 
additional bioinformatics techniques (for example, open reading 
frame (ORF) analysis, identification of internal polyadenylation 
sites, and supporting expressed sequence tag (EST) and protein 
sequence data). In a few cases, a single gene was represented by more 
than one EVG, indicating possible alternative splicing. Other algo- 
rithms are being developed to address this issue. 

Applications of tlilng arrays 

Exon-based gene validation arrays can be limited by the fact that 
gene prediction programs perform best on internal' exons and not 
very well on initial and terminal exons, or exons that correspond to 
the 5' and 3' untranslated regions (UTRs) of mRNAs^ Oligonu- 
cleotide tiling arrays of overlapping probes (Fig, 3) can effectively 
address this challenge because they are constructed without any a 
priori knowledge of the possible exon content of a genomic 
sequence. We designed tiling arrays covering both strands of various 
genomic regions on chromosome 22 defined by EVGs where the 
underlying gene structure was thought to be incomplete. 

Figure 3 shows how the tiling approach was used to refine the 
structure of the novel testis transcript described in Fig. 2e. We 
fabricated an ink-jet array that contained 60-mer probes spaced in 



10-base-pair (bp) intervals across both strands of the 113-kilobase 
(kb) bacterial artificial chromosome (BAG) clone containing the 
EVG of interest. The array was hybridized with fluorescently labelled 
testis mRNA and the resulting probe intensities were analysed to 
determine the approximate locations of the exons within this 
region. For each exon, the hybridization data effectively reduced 
the search for the intron-exon boundaries to regions of around 20- 
30 bp. The exact splice junctions can generally be identified within 
these narrow windows by using common rules (for example, GT- AG 
consensus sequence and ORF analysis). For the gene shown in Fig. 3, 
only four of the six exons were correcdy predicted by Genscan. Our 
results extend the 3' UTR by 450 bp and one of the internal coding 
exons by 102 bases (34 amino acids). These results were confirmed 
by polymerase chain reaction with reverse transcriptase (RT-PCR) 
and sequencing (data not shown). The mRNA (GenBank accession 
no. AF324466) derived from this validated and corrected gene is 
1,312 nucleotides long, including a 649-base 3' UTR with a poly- 
adenylation signal at base 1,293, It encodes a 217-residue protein 
and a BLASTP search revealed only one significant match (£- value 
10"^^) to a predicted gene product, CG5280 firom the Drosophila 
genome project^^. 

Human genome scan using exon arrays 

To show that the approach described above can scale to survey the 
entire human genome, we used the 15 June 2000 version of the 
Ensembl human genome annotation data set (http://www. 
ensembl.org/)^* to make 50 arrays containing 1,090,408 oligo- 
nucleotide probes representing 442,785 exons predicted by 
Genscan^^. Fluorescendy labelled cDNAs from a human lymphoma 
cell line and a colorectal carcinoma cell line were hybridized to the 
arrays. Analysis of fluorescence intensities from this single pair of 
experimental conditions provided experimental evidence for 58% 
of the 78,486 Ensembl confirmed exons. We detected 34% of the 
364,299 predicted exons that did not meet the Ensembl 'confirmed' 
criteria. The false positive rate for this analysis was estimated to be 
around 5%, from an analysis of a set of negative control probes 
included on the arrays. A summary of the exons validated by this 
genome survey is given (see Methods) in Fig. 4 and a fiill listing is 
available as Supplementary Information or from the Rosetta website 
at www.rii.com. 




Figure 1 Design and fabrication of exon arrays for the predicted exons on human 
chromosome 22. Two 60-mers were selected from each of 8,1 83 predicted axons on 
human chromosome 22q and printed on a single 1 x 3 inch array (-25,000 60-mers). 
This array was hybridized with 69 pairs of RNA samples using a two-colour hybridization 
technique. Each experiment was performed in duplicate with a fluor reversal to minimize 



possible bias caused by the molecular structure of the Cy3 and Cy5 dyes (138 arrays in 
total). Red and green spots, as shown in the expanded panels on the right, are probes 
representing experimentally verified genes (groups of differentially expressed exons that 
are located next to each other in the genome). 
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-8,000 exons covering 33 Mb region of human chromosome 22 
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Figure 2 Using expression data from multiple conditions to validate exons and define 
gene boundaries on cfiromosome 22. a, Pseudocolour image showing error-weighted 
logio expression ratios (red/green) tor each of the -8,000 exons (x-axis) across the 69 
f luor-revereed experiments (/-axis), A brief description of each experiment is listed on the 
right side of the image; the numbers (1 -69) are reference points for the Table in the 
Supplementary Information. The 15.511 probes representing the 8,183 predicted exons 
are arranged lineariy across the 33 Mb of chromosome 22. b. Expanded region showing a 
known gene (SERPIND1 , m_ 000185). The experiments on the y-axis have been 
clustered to emphasize how co-regulation across diverse experiments can be used to 



group exons into genes. Tlie vertical white lines indicate the boundaries predicted by our 
gene finding algorithm; numbers on y-axis indicate experimental conditions, c, Expanded 
region showing a set of co-regulated exons from another known gene (G22P1 , 
NM_0014690, illustrating the detection of potential false positives (an-ow) made by the 
Genscan prediction program, d, Expanded region representing an EVG that collapses two 
Unigene EST clusters (HS.269963 and HS.14587) into a single transcript, e, Expanded 
region showing an EVG containing six exons that are part of a novel testis-expressed 
transcript (arrows, two experiments involving testis RNA samples). 
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Discussion 

Post-genome biology and medicine will increasingly rely on com- 
plete and accurate catalogues of human genes, mRNAs and pro- 
teins. This Mparts list' is currently a patchwork of mostly hypothetical 
entities with varying degrees of supporting evidence. Computa- 
tional techniques for sequence annotation provide invaluable clues 
to gene structure and function but experimental data will be 
required to provide a full and satisfying picture. Our microarray- 
based technology represents a comprehensive and consistent 



Table 1 Gene validation summary of human chromosome 22q 





Annotation 


Expression-verified 


Validation 




from ref . 2 


genes (EVGs) 


fraction 


Known genes* 


247 


210 


85% 


Related genes* 


150 


99 


66% 


Predicted genes* 


148 


78 


53% 


Ab initio genes* 


325 


185 


57% 



EVG sequences were searched against current versions of ctoEST and nr (www.ncbi. nlm.nih.gov) 
and significant matches were defined as those having an £-vaIue < 10"". 
* Category definitions according to Dunham et ai. 



approach to the simultaneous validation of gene predictions and 
study of the transcriptome under any number of biologically or 
medically interesting conditions. Our approach is applicable on a 
genome scale and also on the scale of defining the structure of a 
single, novel cDNA. 

The exon-based approach is well suited to high-throughput 
screening of diverse cell types, growth conditions and disease 
states. Differential expression is an important tool for assembling 
exons into genes. We could detect differential expression for only 
1 5% of the confirmed exons across the human genome with a single 
condition pair. Clearly, larger data sets will be essential for defining 
the structures of genes, detecting rarely expressed genes and 
addressing more complex issues such as alternative splicing. In 
addition, information from the exon analysis can be used to select 
genomic regions and samples for comprehensive tiling arrays. 

Ambitious efforts to clone and sequence *full-length* cDNAs for 
the human'* and mouse^^ genomes have begun with the purpose not 
only of helping to validate computational gene predictions but also 
of providing physical reagents for functional and structural geno- 
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Figure 3 Characterization of a novel testis transcript using tiling arrays, a, An EVG 
discovered in the analysis of chromosome 22 (Fig. 2e) ms localized to a 10-kb region at 
one end of the insert of BAG clone AL031 587. Both strands of this 1 1 3-kb genomic 
interval were tiled with 60-mer prot)es at 1 0-bp steps. The tiling array was hybridized with 
RNA from human testis, b, Hybridization signals corresponding to tiling probes from this 
region were filtered and plotted as logio values of the normalized signal strengths. Of the 
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six Genscan predicted exons in this region, two (exons 3 and 6) were at variance with the 
hybridization data, c, Detailed views of tiling data showing one correctly predicted exon 
and one incorrectly predicted exon. d. Typically, tiling data narrow the search window for 
an intron/exon boundary to 20-30-bp. The exact splice junction is then identified using 
consensus sequences (GT-AG rule) and ORF information. The exact splice junction can 
also be detemiined by sequencing RT-PCR products. 



NATURE I VOL 409 1 15 FEBRUARY 2001 1 www.nature.com 



<^©2001 Macmillan Magazines Ltd 



25 



articles 



mics. The comprehensive set of EVGs generated by our approach 
will accelerate these efforts by allowing a more directed cloning 
strategy. We also expect that hybridization data defining EVGs will 
be useful in 'training' the next generation of gene prediction 
algorithms, in much the same manner that sequence similarity 
data enhances ab initio predictions in the current state-of-the-art 
programs. In this way, the maximum value can be realized from the 
intersection of computational and high-throughput experimental 
biology. 

Our experimental method of annotating the human genome 
could be rapidly reiterated for updated sequence information from 
the Human Genome Project, and could easily be extended to the 
genomes of other organisms. Generating exon and tiling arrays 
requires only the availability of genomic sequence and exon pre- 
dictions, from which probes can be rapidly and efficiendy synthe- 
sized onto an anay. The flexibility and short time scale for designing 
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Figure 4 Whole-genome scan for validating predicted exons. a, Fifty 1 x 3-inch ink-jet 
arrays were used to test 442.785 Genscan predicted exons under two conditions. For 
each predicted exon, the t)est one or two 60-mer probes were selected, resulting in the 
set of 1 ,090,408 probes which were distributed over 50 arrays (-25.000 eO-mers per 
array). The arrays also included 11 0,000 reverse complement probes and 48,500 control 
probes. The arrays were hybridized with Cy-3 or Cy-5 labelled mRNAf rom two human cell 
lines (Raji and Colo). Enlarged image, probes representing exons from a known gene with 
alternating forward and reverse complement probes. Ail expemaitts were performed in 
duplicate with a fluor reversal (100 arrays total), b. The sizes of the 24 human 
chromosomes (left), c, The number of predicted exons that were experimentally verified 
(red bars) for each of the chromosomes. Grey bars, number of predicted exons on each 
chromosome, d, A similar analysis for the confirmed exons across the human genome. 



and fabricating exon and tiling anays using the ink-jet platform 
could substantially accelerate gene discovery. 

Finally, our approach could be useful in the identification and 
analysis of genes underlying complex diseases. Genetic linkage 
studies of polygenic traits typically yield a dozen loci, each up to 
20-30 megabases long. It is feasible to construct tiling arrays across 
all loci and probe them with mRNA samples from relevant normal 
and diseased tissues to ascertain both gene content and activity. 
Such analyses may provide not only more direct routes to the 
culpable genes, but also have the potential to uncover regulatory 
mutations by observed alterations in gene activity. □ 

Methods 

Sources of predicted exons 

To analyse chromosome 22q, we designed a single ink- jet oligonucleotide microarray to 
represent 8,183 sequences that had been identified or confirmed as having coding 
potential {Sanger Centre). We used two sources of information: 6,650 Genscan-predicted 
exon sequences, and 3,381 validated exon sequences identified by aligning the first 
complete version of the human chromosome 22 sequence with sequences firom 
experimentally validated transcripts^. Of this set of 10,031 exons, 1,847 had coordinates 
identical to those of other exons and were removed firom the sequence pool. The remaining 
8,183 exon sequences were subjected to an oligonucleotide design process to identify the 
two best cajididate probes for a given exon sequence (see below). For the whole -genome 
exon scan, we designed ink- jet oligonucleotide microanrays to 442,783 predicted exons 
selected fi-om the publidy available assembled sequence in the Ensembl database as of 15 
June 2000. Specifically, we selected 554,202 non- redundant sequences from an initial set of 
628,635 Genscan predicted exons'*. We removed 111,417 more sequences from the list 
after they were flagged by the RepeatMasker algorithm (http://ftp.genome. Washington. 
edu/cgi-bin/RepeatMasker) . 

Probe selecllon tor the exon-scanning arrays 

For each of the predicted exons, we selected the top two 60-mers using an algorithm that 
takes into account binding energies, base composition, sequence complexity, cross- 
hybridization binding energies and secondary structure. For exon sequences of 60 
nucleotides or less, we designed a single probe consisting of the entire exon sequence. For 
the 8,183 predicted exons on chromosome 22, 15,511 60-mers were selected and printed 
on a single array. For the whole-genome exon arrays, we selected 1,090,408 60mers to 
represent the 442,785 GenScan predicted exons from the Ensembl database. For 78,486 of 
the exons annotated as 'confirmed*, the reverse-complement probes were also selected and 
placed next to the regular probes on the array as negative controls. 

Probe selection for tiling arrays 

In the tiling experiment described in Fig. 3, 60-mer probes were placed in 10-bp intervals 
across a 1 13.8-kb region of chromosome 22 containing the novel testis transcript described 
in Fig. 2e (BAG done AL031587). The reverse complements fi)r each of the tiling probes 
were also induded on the array to allow transcripts on either strand to be detected. The 
genomic sequences used in the tiling experiments were repeat-masked before probe 
selection but no other exclusionary filters were applied. 

Airay synthesis 

We synthesized the oligonudeotide arrays on 1 X 3- inch glass slides using ink- jet 
technology^'. The phosphoramidite monomers were delivered by a standard ink-jet 
printer head to defined positions on a glass surfiace containing exposed hydroxyl groups. 
The remaining synthesis steps are similar to traditional oligonudeotide synthesis. Using 
this approach, up to 25,000 different 60-mers can be synthesized on a single slide. Around 
1,000 Vidline' probes (5' CCTATGTGACTGGTCGATGCrACIA 3') are placed around 
the perimeter of each array. Huorescently labelled synthetic oligonucleotides comple- 
mentary to the control probes are included in all hybridizations. Arrays based on Rosetta 
designs were purchased from Agilent Technologies. 

Preparation of labelled cDNA 

We used the following human cell lines: lurkat (T lymphocyte, ATCC no. TIB- 152), K562 
(chronic mydogenous leukaemia, ATCC no. CCL-243), Raji (Burkitt's lymphoma, ATCC 
no. CCL-86), Colo (colorectal adenocardnoma, ATCC no. CCL-220), 293 (embryonic 
kidney, ATCC no. CRL- 1573.1) and HepG2 (hepatocellular cardnoma, ATCC no. CRL- 
U997). Poly-A"^ RNA (mRNA) was isolated from each of the cytoplasmic RNA samples as 
described^. The 'pool' RNA sample described in Fig, 2 contains an equal mbtture of four 
human cell lines (Jurkat, K562, Raji and Colo). The 41 mRNA samples from the human 
tissues described in Fig. 2 were purchased from commerdal sources and are described at 
www.rii.com/Publications. For a single hybridization, we combined 1.5 |xg of mRNA with 
1.0 ^ of random 9-mers and incubated the mixture for 10 min at 70 *C 5 min at 4 "C and 
10 min at 22 ''C To this mbcture we added 0.5 mM amino-all)d dUTP (Sigma A-0410), 
0.5 mM dNTP, 1 X RT buflfer, 5 mM MgQj, 10 mMDTT and 200 units of Superscript 
(GibcoBRL), bringing the final reverse transcription reaction volume to 40 pi. This reverse 
transcription reaction was incubated for 20 min at 42 "C and the RNA was hydrolysed by 
adding 20 jjj EDTA + NaOH and incubating at 65 "C for 20 min. The reaction was 
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neutralized by adding 20 jU of IM Tris-HCl pH 7.6. We concentrated the resulting amino- 
allyl labelled single-stranded cDNA using a Microcon-30 (Millipore), and coupled it to 
Cy3 or Cy5 dye using a CyDye kit (Amersham Pharmacia Q15108). The per cent dye 
incorporation and total cDNA yield were determined spectrophotometricaDy. Pairs of 
Cy5/Cy3-labeUed cDNA samples were combined and hybridized as described". 

Analysis and visual display of expression data 

Array images were processed as described"^ to obtain background noise, single channel 
intensity and associated measurement error estimates. Expression changes between two 
samples were quantified as logio (expression ratio) where the *expression ratio* was taken 
to be the ratio between normalized, background-corrected intensity values for the two 
channels (red and green) for each spot on the array. An error model for the log ratio was 
applied" to quantify the significance of expression changes betiveen two samples. The 
colour displays in Fig. 2 show logio( expression ratio) as red when the red channel is 
upregulated relative to the green channel, green when the red channel is downregulated 
relative to the green channel, black when log to( expression ratio) is close to zero, and grey 
when data fi-om one or both of the channels for a given probe are unreliable. 

Identtfying EVGs by co-regulation 

Exons were grouped into EVGs by a two-step gene identification algorithm. First, each 
probe was assigned a similarity measure, based on the moving average (using a window 
size equal to six probes) of pair-wise Pearson correlation coefficients between the log ratios 
of probe intensities in neighbouring exons. Probes with correlation coefficients above 0.5 
in a given window were selected as seeds for EVGs. The 0.5 threshold and window size were 
determined empirically by training the model on a subset of the known chromosome 22 
genes. Second, probes neighbouring a seed region were merged into the region if the pair- 
\vise correlation coefficients between the neighbouring probe and the average in the seed 
region exceeded 0.5. This process continued, allowing fur gaps between probe pairs to 
account for failed probes and/or 6alse exon predictions (gaps were not allowed to exceed 
6ve probes), until no probes flanking the candidate region met the significance threshold 
of correlation with the exon duster. The final exon dusters resulting from the gene 
detection algorithm were identified as an EVG. Not all condition pairs (rows) were 
considered in forming EVGs. Elements in a given row had to have significant P values 
( ^ 0.01) to be included in the analysis. Once an EVG was formed, the colour display (as in 
Fig. 2) was updated by reoidering the condition pairs according to a hierarchical clustering 
algorithm, as described^". 

Annotation of EVGs 

Predicted transcripts for all EVGs identified from the chromosome 22 exon data across the 
69 condition pairs were formed by combining the individual exons into a single sequence. 
Each of these sequences was searched against dbEST and the NR protein databases using 
gapped BLASTN and BLASTX (www.ncbi,nlm.nih.gov), respectively, to determine the 
extent to which the EVG sequences were similar to other sequence data. We declared 
sequences similar if the corresponding £-value for the alignment was less than 10~^, using 
default parameters for gapped BLAST. BLAST residts were used to determine the degree of 
sequence support defining a predicted transcript. These results were also used to 
determine the degree of existing sequence support for each of the EVGs detected from the 
chromosome 22 exon arrays. 

Quantitative analysis of whole-genome exon data 

We used an intensity-based algorithm to verify predicted exuns experimentally across the 
entire human genome. Specifically, we used raw intensity measurements for the forward - 
strand (FS) probes and the corresponding raw intensity measurements for the reverse- 
complement (RC) probes in conjunction with the respective standard deviations of those 
measurements to determine the significance of the FS probe intensities. We controlled for 
nonspecific cross-hybridization using RC probes, given that the reverse complement of a 
ONA sequence has equivalent sequence complexity to the forward strand sequence with 
respect to a rariety of measures (such as GC content and GC trend). An exon was called 
'present' if the intensity difference between an FS probe and the RC probe had P < 0.01 in 
either the red or green channd, and if the FS probe intensity had a P < 0.01 for being above 
background in the channel where the difference was considered most significant. P values 
were calculated using a f-test applied to the difference of the mean pixd intensities and to 
the difference of the mean FS/bacl^jround intensities. 

These single channel exon detection methods were applied only to those exons in which 
reverse-complement probes were designed. In the remaining cases, the significance of the 
single channel intensities was determined using the above-background criterion desaibed 
above. We applied a correction to the detection percentages given for the predicted exons 
listed in Fig. 4, based on felse positive estimates for above-background calls that were 
determined using the FS/RC probe intensity difiference calls for the confirmed exons. Error 
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modds used in this analysis to assess ratio significance were as described^. Of the 88,374 
confirmed emm represented on the genome-wide exon arrap, 78,486 had corresponding 
RC probes. To assess the rate of fake positives expected in the single-channel assessments, 
we used a similar detection procedure to detennine the number of RC probe intensity 
measurements that were significantly greater than the corresponding FS probe intensity. 
Our results indicate that the hUx positive rate of detection using the single channel 
method was ~ 5%. 
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Mining the human genome using microarrays of open 
reading frames 
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To test tlie liypothesis that the human genome project wiii 
uncover many genes not previousiy discovered by sequencing of 
expressed sequence tags (ESTs), we designed and produced a set 
of microarrays using probes based on open reading frames 
(ORFs) in 350 Mb of finished and draft human sequence. Our 
approach aims to identify all genes directly from genomic 
sequence by querying gene expression. We analysed genomic 
sequence with a suite of ORF prediction programs, selected 
approximately one ORF per gene, amplified the ORFs from 
genomic DNA and arrayed the amplicons onto treated glass 
slides. Of the first 10,000 an-ayed ORFs, 31% are completely 
novel and 29% are similar, but not identical, to sequences in pub- 
lic databases. Approximately one-half of these are expressed in 
the tissues we queried by microarray. Subsequent verification by 
other techniques confirmed expression of several of the novel 
genes. Expressed sequence tags (ESTs) have yielded vast 
amounts of data**'^, but our results indicate that many genes in 
the human genome will only be found by genomic sequencing. 
We downloaded for analysis all human genomic clones greater than 
50 kB in size, spanning less than 10 contigs, and submitted to Gen- 
Bank between 15 May and 15 October 1999. This corresponds to 
2,354 clones, or approximately 350 Mb. After masking repetitive 
elements, we sought ORFs using three gene-finding algorithms 
developed on independent training sets: Grail (which uses a neural 
network; ref. 3), Genefinder (which uses a hidden Markoff model; 
ref. 4) and DiCTion (which searches for coding regions based on 
Fourier transform methods: J. Graham, pers. comm.). Gene-find- 
ing programs are notorious for their generation of false-positive 
results^ The question thus arises: are these predicted coding regions 
real? To minimize this problem we required at least two indepen- 
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dent predictions^ of a coding region as a criterion for evaluation by 
microarray analysis. We collected the consensus ORFs into putative 
gene 'bins' using an empirical criterion. We then chose the largest 
ORF from each bin that did not contain any repetitive sequence. We 
also selected all consensus ORFs greater than 500 bp. We thereby 
attempted to approximate one exon per gene, but a number of 
genes were represented by multiple elements (Fig. 3; and Fig. A, see 
http://genetics.nature.com/supplementary_info/), and it is highly 
probable that we have missed some genes. Primers were designed to 
PCR-amplify 500-bp sequences centred around the ORF of interest, 
and universal primer sequences were added to the 5' end of each 
ORF-spedfic primer 

We observed a mean exon size of 229 bp and a median size of 
150 bp (n=9498). In contrast, the amplicons were designed to 
have a very narrow distribution (475±25 bp), facilitating uniform 
retention on the microarray^ Thus, approximately 50% of the 
average PGR product represents a coding region. We found that 
larger exons tended to yield absolute hybridization signals of 
greater magnitude (data not shown) , and, for this reason, sought 
to avoid using short exons as probes. The ORFs were PGR ampli- 
fied from genomic DNA and their sizes confirmed by agarose-gel 
electrophoresis. We successfully sequenced 65% of all PGR ampli- 
cons, confirming their identity. Most 'unsuccessful' sequences 
were due to failed PGR reactions; others yielded poor sequence 
data. The reasons for this are unclear, but may be related to the 
quality of early draft sequence including misassemblies, inclusion 
of vector and host contamination. We thus distilled 350 Mb of 
genomic DNA to 9,498 amplicons, and spotted them in duplicate 
onto treated glass slides. 
All arrayed gene element sequences were subjected to BLAST 

(ref. 8) analysis against the 
GenBank databases (7 May 
1999; release 2.0.9): 40% of the 
sequences obtained an exact 
match (E values<l e"^^ with 
either an EST or a known 
mRNA; 29% showed some 
homology with a known EST or 
mRNA (E values=l e"^ to 1 
e~^^). The remaining 31% of the 
elements showed no sequence 
homology with GenBank 
sequences. It should be noted, 
however, that as our selection 
process does not favour central 
or terminal exons, it Is likely 
that some of the apparentiy 
novel exons are parts of genes 
whose 3' or 5' ends have been 
captured as ESTs. 



Fig. 1 Representation of clustering of 
expression. Pictorial representation of 
the expression pattern of all verified 
sequences which showed expression 
greater than 3 in any single tissue or 
celt type. EST, non-redundant (NR) 
and SwIssProt homologies are repre- 
sented above the tissues: black, 
known genes; white, novel genes; 
grey, homologues of known genes. 
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Table 1 • Brain-specific transcripts identified by microarray 



Chip sequence 


Homology to known EST 


Protein function as ascribed by Gen Bank 


AP000217.1 


high 


S-100 protein, p-chain, Ca^* binding protein expressed in central nervous system"''' 


APOOO 047-1 


high 


unknown function 


AC006548-9 


high 


similar to mouse membrane glycoprotein M6, expressed in central nervous system 


AC007245-5 


high 


similar to amphiphysin, a synaptic vesicle-associated protein''^ 


L44140-4 


high 


endothelial actin-binding protein found in non-muscle filamin 


AC004689-9 


high 


protein phosphatase PP2A; neuronal: do wnregulates activated protein kinases^^ 


AL031657-1 


high 


unknown function; contains ankyrin motif 


AC009266-2 


low 


low homology to the synaptotagmin 1 protein in rat^^ 


AP000086.1 


low 


unknown, very poor homology with collagen 


AC004689-3 


high 


Protein phosphatase PP2A, neuronal; downregulates activated protein kinases"'^ 



The microarrays were hybridized vnth RNA samples obtained 
from seven tissues and three cell lines. Each mRNA was reverse- 
transcribed to Cy3-cDNA. A pool of the RNAs obtained from all 
ten sources (that is, the tissues and cell lines) was transcribed to 
Cy5-cDNA and used as a reference target. Whereas this strategy 
allowed us to survey a large number of tissues, it attenuated the 
measurement of relative gene expression, as every highly 
expressed gene in the tissue or cell type channel will be present at 
a level of at least 10% in the control channel. For this reason, we 
represented data both in terms of normalized ratios and normal- 
ized signal intensity. Of 9,498 arrayed elements on 2 chips 
(including positive and negative controls and 'failed' products), 
4,800 (51%) were expressed in at least 1 tissue or cell type. Of the 
gene elements showing significant signal, approximately 1,870 
(39%) were expressed in all 10 tissue or cell types. In the next 
most common grouping, approximately 720 (15%) were 
expressed in only a single tissue or ceil type. 

We further analysed the genes expressed predominantly in a single 
tissue (Fig. 1) and found that a large fraction of them are novel, even 
in tissues that have been exhaustively studied by EST sequencing. 

We used several approaches to validate our methods. We 
included BAG AC006064 in the sequence data set, as it is known to 
contain the gene encoding g!yceraldehyde-3-phosphate dehydro- 
genase (GAPDH) . The algorithms selected 25 sequences from BAG 
AG006064 for spotting onto the array, of which 4 corresponded to 
GAPDH. A commercially available GAPDH cDNk was also spot- 
ted onto the array: hybridization with labelled target showed excel- 
lent agreement between expression detected by probes 
representing single exons (with an average length of 229 bp) and 
that detected by the full-length cDNA control (of 1,100 bp; data 
not shown). 

To confirm the novel gene sequences, we selectively assayed gene 
expression by PGR from a panel of commercially available cDNAs 
derived from a variety of tissue mRNAs. The primers used were 
selected from those used to generate the microarray probes. We 
selected the elements according to the following criteria: (i) they 
were previously absent from public databases as coding sequences; 
(ii) they sequenced successfully; and (iii) they yielded interesting 
tissue-specific gene-expression patterns as measured by microar- 
ray. We thereby confirmed that AL079300-1 is expressed in cardiac 



tissue, and AL031734-1, in placental tissue. Neither was found to 
be expressed by other tissues analysed by microarray. 

Glearly, all microarray results cannot be confirmed by indepen- 
dent methodologies. Evidence supporting our data, however, 
exists in public databases. For example, we carried out an analysis 
of the amplicons that showed high signal only in brain, of which 
there were 82. Of the ten with the highest degree of expression, six 
are known to have a role in the central nervous system or brain 
(Table 1). We then sought matches for the sequences generating 
the greatest (normalized) signal intensities in brain, regardless of 
expression in other tissues. Of the 20 with highest expression in 
the brain, 3 were similar to the gene encoding tubulin (AG008079- 
5, AF146191-2. AG007664-4) . 2 were similar to the gene encoding 
actin (AL035701-2, AL034402-1) and 5 were homologous with 
GAPD (AL035604-1, Z86090-1. AG006064-L, AC006064-K, 
AL035604-3). Expression patterns across multiple tissues were 
also confirmed. For example, sequence L29074-7, which encodes 
a ferritin heavy chain protein, is reported to be expressed in brain 
and liver^, consistent with our results obtained by microarray. 

On completing sequence analysis of chromosome 22, Dunham 
et aJ}^ predicted that it contains at least 679 genes. Our study 
analysed 49% (16 Mb) of the chromosome-22 sequence. Assum- 
ing we predicted 1 exon per gene, we found 298 genes that were 
also predicted by Dunham et al., which implies that we would 
have identified approximately 90% of the genes identified by 
Dunham et al. had we analysed the entire chromosome. We also 
found 235 additional potential genes not included in the minimal 
set of Dunham et al. Whereas some of these will be duplicates 
(generated by two ORFs from distant parts of the same gene) and 
false positives, many will be real (see example AL079300-1 above) . 

For each genomic clone analysed by microarray, we accumu- 
late a plethora of information. We have therefore devised a visu- 
alization tool to present this information, which we call a 
'Mondrian' (in deference to the Belgian painter; Fig. 2) . A 'Mon- 
drian' of a BAG encompassing the gene encoding carbamyl phos- 
phate synthetase illustrates the high degree of confidence with 
which single exons may be used to query gene expression (Fig. A, 
see http://genetics.nature.com/supplementary_info/), whereas a 
Mondrian of another BAG (Fig. 3) illustrates the power of the 
strategy for detecting genes. 
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Fig. 2 A 'MorKtrlan' of a Virtual' BAG. The red line running left to right depicts 
clone sequence. Results from exon-finding programs are represented above 
the red line (blue, GeneFinder; green, Grail 11; grey, DiCTion). Sequences were 
selected as described in the text, and depicted as a white bar masking out the 
red line. Information about homology of the sequence within GenBank Is 
depicted above the red line, when available. Black indicates *known' and white 
indicates unknown regions. Microarray expression data is represented below 
the red line. A colour depiction of ratio-based gene expression for expression 
in three tissues is shown in this region. Shades of green depict elevated expres- 
sion in that sample over the control and shades of red depict higher signal In 
the control sample. Darker shades of either red or green indicate higher or 
lower ratios of gene expression compared with the control. Within each bar, a 
white circle is drawn, its size proportional to the size of the signal intensity. 
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The preparation of DNA microarrays by spotting gene frag- ential gene expression. We note, however, that measuring expres- 

ments directly amplified from genomic DNA is a new method of sion of each exon In a gene may also be an effective method for 

gene discovery, and contrasts with the standard practice of using detecting alternate splicing and determining its tissue-specific pat- 

cDNA libraries for generating microarray probes in that the ampli- tern. Application of the technique allows discovery and characteri- 

fied fragments are thoroughly characterized and of uniform zation of genes in new genomic sequence in advance of isolation 

length, and previously unknown genes can be analysed for differ- and doning. Data are available at http://www.hmorfxom. 




Fig. 3 A study of BAG AL049839, depicted by 'Mondrian', We setected 12 exons from this BAG, of which 10 were successfully amplified and sequerwed. The ten exons we 
arrayed represent four to six genes (four known and two putative genes). The four known genes are protease inhibitors. Tliese data show that exons selected from the 
same gene display the same expression patterns, depicted below the red line. A novel gene is also found from B6.6 kb to 88.6 kb, in agreement with all the exon-findlng 
programs. Similar expression patterns and their proximity to one another suggests the two exons are common to a single gene. Red represents the kallistatin protease 
inhibitor {P29622); purple represents the serine protease Inhibitor (P05154); turquoise represents a1 anti-chymotrypsin (POlOn); and mauve represents 40S ribosomal 
protein (P08865). Each panel represents 25 kb. 
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Methods 

Preparation of labelled cDNA. Human niRNA samples were purchased from 
commercial sources (Clcntech and ATCC): heart, brain, liver, fetal liver, pla- 
centa, lung, bone marrow, and the cell lines HeLa S3, BT 474 (human breast 
ductal carcinoma cell line) and HBL 100 (human breast cell line) . CyS-dCTP 
and CyS-dCTP (Amersham Pharmacia Biotech (APB)) were incorporated 
into cDNA during reverse transcription as follows. Poly (A) mRNA (1 iig), 
Oligo(dT) 12-18 primer (1 |ig) and random 9-mer primers (2 Jig) in a volume 
of 1 1 Jll were incubated at 70 *C for 10 min. After snap cooling on ice. the fol- 
lowing was added to the WA to the stated final concentration and to a vol- 
ume of 20 pi: IxSuperscript II buffer, 0.01 M DTT. 100 \M dATP, 100 \iM 
dGTP, 100 \iM dTTR 50 \iM dCTP. 50 \iM Cy3-dCTP or CyS-dCTP, and 200 
U Superscript II enzyme. The reaction was incubated for 2 h at 42 *C. After 
incubation, the cDNA was isolated by adding 1 U Ribonuclease H and incu- 
bating for 30 min at 37 'C. The reaction was then purified using a Qiagen 
PGR cleanup column. Probe was eluted using Tris HCl (10 mM, pH 8.5) . 

Hybridization. Dye incorporation and total cDNA were determined spec- 
trophotometrically. A volume of probe equivalent to 50 pmoles of Cy3 and 
Cy5 dye was dried and resuspended in 30 jiJ hybridization solution (50% for- 
mamide, SxSSC, 0.2 \i%/\Jl poly (dA) , 0.2 \Lg/\iX human Cotl DNA. 0.5% SDS) . 
Hybridizations were carried out under a coverslip, and the array placed in a 
humid oven at 42 *C overnight. SUdes were washed in IxSSC. 0.2% SDS at 55 
*C for 5 min, followed by O.lxSSC, 0.2% SDS, at 55 °C for 20 min. then were 
briefly dipped in water and dried thoroughly under a gentie stream of nitro- 
gen. Slides were scanned on a Molecular Dynamics Generation III scanner^. 

Normalization of microarray data. By 'balancing' dye load before 
hybridization we limit the amount of normalization necessary, because 
signal intensities in each channel become equivalent. Molecular Dynamics 
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spotting instrumentation spots each sample in duplicated Ratios were 
normalized as follows: the average ratio of the two hybridization signals 
from the duplicate spots is accepted for further analysis only if the dupli- 
cate ratios are within 25%. The average ratios of all the accepted data on 
each slide falls within a normalized distribution. The ratios are normal- 
ized by dividing all ratios by the average ratio of all the spots on the slide. 
This normalizes the average ratio around 1, and allows ratios to be com- 
pared between slides. Signal is normalized by dividing all signals by the 
average signal of the spots on the slide. This results in the 'average' spot 
having a normalized signal of 1. We define significant expression as signal 
three times greater than biological noise (average signal from the negative 
control, Escherichia coii genes). 

Microarray controls. Positive and negative controls were spotted on the 
microarray slides. The positive controls were probes encoding GAPDH and 
actin, and Human Cot-1 DNA. The negative controls were salmon sperm 
DNA and Escherichia coh DNA. 

Sequencing. All PCR products were sequenced using energy transfer dye 
terminators (APB) on the Molecular Dynamics MegaBACE using stan- 
dard protocols. 
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